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Comment: Classifier Technology and the 
usion of Progress — Credit Scoring 



Ross W. Gayler 



These comments support Hand's argument for the 
lack of practical progress in classifier technology by 
pursuing them a little deeper in the specific context 
of credit scoring. Academic development of model- 
ing techniques tends to ignore the role of the prac- 
titioner and the impact of business objectives. In 
credit scoring it can be seen that the nature of the 
task forces practitioners to adopt modeling strate- 
gies that positively favor simple techniques or, at 
least, limit the possible advantage of sophisticated 
techniques. The strategies adopted by credit scorers 
can be viewed as a heuristic approach to inference 
of the unobserved (and unobservable) distribution of 
possible data sets. The technical progress examined 
by Hand has been aimed toward better goodness of 
fit. However, technical progress toward a more prin- 
cipled basis for inferring the distribution of future 
problem data would be more likely to be adopted in 
practice. 

1. CREDIT SCORING 

I am approaching this commentary domain- 
specific consumer of statistical technology. My concern 
is credit scoring (the use of predictive statistical 
models to control operational decision-making in con- 
sumer finance). Classical credit scoring is applied at 
the point of application for a loan to predict the risk 
of default (nonpayment) and to make the decision 
whether to approve that application for credit. The 
total value of the loans made under the control of 
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credit scoring is immense, and the value added to 
the economy by better decision-making because of 
credit scoring is correspondingly large. Thus, credit 
scoring is a domain where improved decision-making 
due to better predictive modeling would be valuable 
and technical progress would be expected. 

Somewhat surprisingly, the statistical techniques 
currently used in credit scoring seem rather old- 
fashioned (often being simple regression models). 
This is not for lack of attempts to change the state of 
the art. New modeling techniques are regularly pro- 
posed for credit scoring (typically by academic re- 
searchers), but they are rarely adopted in practice. 
This lack of uptake cannot be blamed entirely on 
conservatism in the credit scoring community. The 
rewards of improvement are sufficiently high that 
once any lender adopts a better technique, there 
will be high competitive pressure for other lenders 
to do likewise. Rather, the continued use of simple 
predictive modeling techniques suggests that they 
have a practical advantage over more sophisticated 
techniques in credit scoring. Understanding the rea- 
sons for this advantage would be useful for the prac- 
tice of applied predictive modeling in credit scoring 
and, more generally, might suggest productive av- 
enues for the development of predictive modeling 
techniques to be applied in practical domains. 

Professor Hand has worked extensively in credit 
scoring and it is likely that his experience in that 
domain motivated the writing of his paper, although 
his thesis, as stated, is not restricted to credit scor- 
ing. As a practitioner of credit scoring, I agree with 
the points he has raised. My aim here is to examine 
Hand's points a little further in the specific con- 
text of credit scoring, looking at the interaction of 
the technicalities of modeling with the demands im- 
posed by the nature of the business task. 

A brief description of the classical credit scoring 
problem is as follows. When credit is granted to con- 
sumers, some of the borrowers will default on their 
loans. The lender typically takes a loss on a de- 
faulted loan. Ideally, a lender would predict which 



2 



R. W. GAYLER 



applicants would default and decline their applica- 
tions for credit, thus avoiding the loss. The lender 
uses data available at the time of application to 
make that prediction and decision. The data may 
come from an application form, a credit bureau and 
the lender's own records if the applicant is an exist- 
ing customer. 

The potential predictors available at the time of 
application are not causally related to the outcome 
of default. Consequently, credit scoring models are 
correlational rather than causal. The outcome of de- 
fault is not just dependent on the characteristics 
of the borrower, but also on external factors such 
as subsequent lender management actions and the 
state of the economy. Furthermore, the data are pro- 
cessed by the operational systems of lenders. These 
systems are constructed with the primary objective 
of carrying out the operational actions. Data collec- 
tion and data quality issues that are relevant to sta- 
tistical modeling are often an afterthought in system 
design (if they are considered at all) . Consequently, 
the data quality is often not what would be desired, 
and data quality problems can be quite dynamic, 
because changes are made to the systems to accom- 
modate short term operational needs. The data are 
noisy, and the quality of the noise is subject to drifts 
and jumps. 

2. REGRESSION RATHER THAN 
CLASSIFICATION 

Given that the occurrence of default is a binary 
outcome, it seems natural to treat credit scoring as 
a classification problem, and many academic papers 
have done so. Assuming a classification framework 
comes close to assuming that there is some ideal 
predictor space in which the outcome classes are 
perfectly separated. Even if such a predictor space 
does actually exist, it is not available to the credit 
scoring practitioner. The available predictors are not 
causally related to the outcome and some predictors 
(e.g., account management actions and changes in 
the economy) are not available at the time of the ap- 
plication because they occur subsequently. For prob- 
lems such as this, as Hand notes more generally, "the 
Bayes error rate is high: meaning that no decision 
surface can separate the distributions of such prob- 
lems very well" (Section 2.3). Given that the out- 
come classes cannot be separated, it may be better 
to adopt a regression framework for modeling and 
predict the probability of default conditional on the 
predictors. 



However, in credit scoring there is an even more 
important consideration than the match between 
the theoretical form of the model and the true state 
of affairs. Lenders need to be able to control the rate 
at which loan applications are declined. This allows 
them to adjust workloads and to control the trade- 
off of profit against volume of business. A classifica- 
tion model yields predictions of "default" or "repay" 
which are mapped to decisions to "decline" or "ac- 
cept" the loan application. Consequently, the decline 
rate is fixed by the predictions and the lender has no 
direct control of the decline rate from a classification 
model. This illustrates the point that credit scoring 
practitioners need to be mindful of the operational 
requirements of lending over and above goodness of 
fit and the theoretical form of models. 

Hand's paper is written in terms of classifiers, but 
his arguments apply just as well to regression mod- 
els used as classifiers. A regression model may be 
trivially converted to a classifier by having the pre- 
dicted outcome be the probability of class member- 
ship and comparing it to a threshold. In fact, this 
is the standard form of credit scoring models. Con- 
versely, some classification models can be converted 
to adequate regression models, but this is not gener- 
ally true. A decision tree with two leaves will never 
make a good regression model. Consequently, even 
though classification models are not well suited to 
credit scoring, Hand's arguments do apply to credit 
scoring as it is practiced. 

3. EQUIVALENCE OF MODELS AND 
DEGREES OF FREEDOM IN THE MODELER 

Hand observed that "a tremendous variety of algo- 
rithms and models has been developed for the con- 
struction of such [classification] rules" (Section 1). 
Different algorithms have different representational 
biases and a different bias/variance trade-off. For 
a fixed set of predictors we would expect different 
algorithms to generate different approximations to 
the outcome. However, in credit scoring the set of 
predictors is not fixed. The model developer is free 
to generate new derived variables in the data set 
and will generally do so to accommodate the partic- 
ular representational bias of the modeling technique 
used. For example, decision tree induction and pro- 
jection pursuit regression are able to automatically 
model interactions in the data, whereas regression 
works only with the predictors it is given and does 
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not create interactive combinations. The credit scor- 
ing modeler using regression would construct inter- 
action predictors if they were thought necessary. 

The objective of every modeling technique is to 
approximate the data. Thus, in the limit (and the 
hands of a skilled modeler), every modeling tech- 
nique should end up in agreement because they are 
all approximating the same data. However, the ef- 
fort required to achieve that degree of approxima- 
tion may vary greatly between techniques. Even for 
techniques that require the same effort to achieve 
a given accuracy of approximation, the models may 
differ in other properties that are operationally im- 
portant to the lender. 

It is also worth recalling Hand's comment about 
the high Bayes error rate (Section 2.3). When the 
ratio of variance accounted for by the response sur- 
face is low compared to the error about the response 
surface (as it is in credit scoring), it becomes harder 
to distinguish between different representational bi- 
ases. Thus we would not expect the differences be- 
tween different modeling techniques to be readily 
observable. 

The impact of the skilled modeler warrants some 
further investigation. Effectively, the modeler sup- 
plies extra degrees of freedom in addition to those 
supplied by the modeling technique. The natural 
consequence of this is to reduce the difference be- 
tween techniques in terms of goodness of fit. Rather 
than compare modeling techniques in terms of pre- 
dictive power, it would be more useful to look at 
the effort required of the modeler to achieve a given 
goodness of fit and other properties of the models 
that are of operational relevance to the lender. 

4. MAIN EFFECTS AND INTERACTIONS 

In Section 2.3, Hand mentions "examples of ar- 
tificial data which simple models cannot separate 
(e.g., intertwined spirals or checkerboard patterns)," 
noting that "such data sets are exceedingly rare in 
real life [and] it is common to find that the cen- 
troids of the predictor variable distributions of the 
classes are different." This is a claim that problems 
which can be modeled only as interactions of the 
variables (with no observable main effects) are rare. 
This may well be true in general because of the im- 
probability of interactions exactly canceling out to 
leave no main effects. However, in credit scoring it 
is also true for domain-specific reasons. The inclu- 
sion of each predictor in a decision-making system 



has to be justified (operationally and legally). It is 
much easier to argue for the inclusion of a predictor 
if the argument can be made for that predictor in 
isolation. Conversely, it is harder to argue for the 
inclusion of a predictor if it can be shown to add 
value only in the context of other predictors. 

Furthermore, credit scoring practitioners are very 
concerned with the stability over time of their mod- 
els. Some credit scoring models are used for years 
before being replaced. Therefore, it is important to 
ensure that the predictive relationships on which the 
model is based are stable over time. Credit scoring 
practitioners tend to believe that main effects are 
more stable than interactions (all other things be- 
ing equal) . When interactions are included as predic- 
tors, it is generally because the modeler has a prior 
belief that the interaction reflects some stable mech- 
anism in the world. An otherwise unmotivated in- 
teraction that is discovered by an automated search 
procedure is unlikely to be included in a predictive 
model or, if it is included, to have its influence in- 
tentionally limited relative to the main effects. The 
effect of these selection biases is to ensure that credit 
scorers prefer simpler models based on main effects. 

5. SENSITIVITY TO ARBITRARY 
MODELING DECISIONS 

Hand notes that when constructing classification 
rules, "various . . . assumptions and choices are of- 
ten made which may not be appropriate" (Section 1) 
and even when they are entirely appropriate, the 
choices may be somewhat arbitrary. He gives the 
example of typically defining "a customer as 'de- 
faulting' if they fall three months in arrears with re- 
payments . . . [while] [i]t is entirely reasonable that 
alternative definitions (e.g., four months in arrears) 
might be more useful if economic conditions were to 
change" (Section 4.2). Credit scoring necessarily in- 
volves many detailed decisions concerning the mod- 
eling process. Many of these decisions involve com- 
promises and trade-offs, with no obviously correct 
answer. While the experienced credit scorer would 
have arguments for the specific decisions made, it would 
be a bold modeler who would argue that the de- 
cisions taken were uniquely and obviously correct. 
Thus, there is an element of arbitrariness in the 
modeling process. 

It is possible to conceive of a space of feasible mod- 
eling decisions. Similar sets of decisions are nearby 
in that space. A small change in the modeling deci- 
sions would generally lead to a small change in the 
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models. However, the possibility exists that a small 
change in modeling decisions may lead to a large 
change in the models that arise from them. This 
would be very unsatisfactory in credit scoring be- 
cause the results of the modeling would be strongly 
dependent on arbitrary modeling choices. Therefore, 
credit scorers tend to restrict their attention to re- 
gions of the modeling decision space where the gra- 
dient of models with respect to modeling decisions 
is low. In these regions, all the models generated as 
a result of the different modeling choices would yield 
similar results. If a new modeling technique yielded 
markedly different results, it would be unlikely to be 
favored by credit scorers unless it was surrounded by 
a region of other models yielding similar results. It 
would be more difficult for the modeler to argue for 
the correctness of the unique results given that the 
choice of modeling technique might be regarded as 
arbitrary. 

6. DEVELOPMENT DATA NOT 
REPRESENTATIVE OF OPERATION 

Hand points out "that in many . . . real classifica- 
tion problems the data points in the design set are 
not . . . randomly drawn from the same distribution 
as the data points to which the classifier will be ap- 
plied" (Section 1). Furthermore, any design set rep- 
resents "merely a single . . . problem drawn from a 
notional distribution of problems" (Section 1). Later 
he notes that "a fundamental assumption of the clas- 
sical paradigm is that the various distributions in- 
volved do not change over time . . . [although this 
assumption] is unrealistic in most commercial ap- 
plications, concerned with human behaviour" (Sec- 
tion 3.1). His concern here is with population drift. 
This would not be a problem if the predictive model 
were the "true" model, but as Hand states "it would 
be a brave person who could confidently assert that 
[this] held" (Section 3.2). 

Population drift is a particular concern in credit 
scoring. Loans which default do so over an extended 
period after the loan has been granted. Consequently, 
an extended outcome period (typically at least one 
year) is required to allow a reasonable proportion of 
loans to default. To this must be added time to accu- 
mulate enough applications to provide a reasonable 
number of observations for modeling and to allow for 
seasonal variation in the applicant population. Al- 
lowing time for data preparation, data modeling and 
implementation of the models into the operational 



system, it is common for the oldest data on which a 
model is based to be three years old when the model 
is first switched on. Then the model may be in use 
for some while (three years is common, and more 
than five years not unknown) . Even if the applicant 
population distribution is stationary, the data col- 
lecting process is subject to random jumps, because 
lenders may change their systems and procedures at 
any time. Thus, a large part of the value added by 
credit scoring practitioners comes from anticipating 
possible future shifts in the data distribution and 
designing the models to be relatively insensitive to 
such shifts. This can be seen as another aspect of at- 
tempting to reduce the sensitivity of the models to 
arbitrary features of the specific design set (in this 
case, characteristics of the data that just happen to 
hold at the time the data are collected). 

The expertise of the credit scoring modeler can 
be thought of as applying a bias to the modeling 
techniques to move the models toward the notional 
distribution of problems. For example, Hand dis- 
cusses the application of a tree model and linear 
discriminant analysis (as competing techniques) to 
consumer credit data, and points out that because 
the design set is always retrospective, the population 
may have drifted by the time the model is built and 
"reduced any advantage that the more sophisticated 
tree model may have" (Section 3.1). A tree model 
fits better than linear discriminant analysis, but de- 
grades more rapidly. There is the possibility that 
the tree model may actually become worse than the 
linear discriminant model with the passage of time. 
Rather than view the techniques as competing, a 
credit scorer might model the data with linear dis- 
criminant analysis and then build a tree model of the 
residuals. This hybrid model puts a bound on dete- 
rioration by predicting the majority of the outcome 
variance using the more stable modeling technique. 

7. FREEDOM VIA THE FLAT MAXIMUM 
EFFECT 

Hand mentions the flat maximum effect in the 
context of explaining that a reasonable fraction of 
the maximum attainable predictive power can be ob- 
tained from an equally weighted combination of pre- 
dictors (Section 2.4). The existence of the flat max- 
imum effect is a great advantage in credit scoring. 
It implies that there may be many alternative mod- 
els with similar goodness of fit. This provides the 
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credit scoring modeler the opportunity to choose be- 
tween those models on some basis other than good- 
ness of fit (e.g., susceptibility to population drift or 
ability to finely control the decline rate). The free- 
dom this confers is so valuable that credit scoring 
modelers prefer to choose predictors that make the 
flat maximum effect more likely to exist. This is the 
case where there is a conditional monotone relation- 
ship between each of the predictors and the out- 
come (which also happens to be the circumstances 
under which a simple linear combination is likely to 
perform well). 

8. VALUE ADD AND MODELING 
TECHNIQUES 

In credit scoring, much of the value added by 
modelers is not via goodness of fit to the develop- 



ment sample, but by anticipation of possible changes 
in the operational systems and data. This can be 
viewed problem of trying to infer the unob- 

served distribution of possible development data sets. 
Credit scorers attempt to achieve this by biasing 
their models toward simple models and techniques. 
These models are not only more likely to generalize 
across potential data sets, but also, as Hand points 
out, to yield most of the predictive power of more 
complex models. More complex models of the cur- 
rent data set are unlikely to be attractive to credit 
scorers. However, techniques that provide a more 
principled basis for generalizing to the distribution 
of possible data sets would be welcome. 



